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ABSTRACT 

A  robust  version  of  Principal  Component  Analysis  (PCA)  can 
be  constructed  via  a  decomposition  of  a  data  matrix  into  low 
rank  and  sparse  components,  the  former  representing  a  low¬ 
dimensional  linear  model  of  the  data,  and  the  latter  repre¬ 
senting  sparse  deviations  from  the  low-dimensional  subspace. 
This  decomposition  has  been  shown  to  be  highly  effective,  but 
the  underlying  model  is  not  appropriate  when  the  data  are  not 
modeled  well  by  a  single  low-dimensional  subspace.  We  con¬ 
struct  a  new  decomposition  corresponding  to  a  more  general 
underlying  model  consisting  of  a  union  of  low-dimensional 
subspaces,  and  demonstrate  the  performance  on  a  video  back¬ 
ground  removal  problem. 

Index  Terms —  Compressive  Sensing,  Robust  Princi¬ 
pal  Component  Analysis,  Low  Rank,  Sparse  Representation, 
Group  Sparse 

1.  INTRODUCTION 

Matrix  completion ,  which  attempts  to  reconstruct  a  matrix 
with  only  a  small  fraction  of  its  entries  known  [1],  is  a  recent 
branch  of  the  field  of  compressive  sensing.  (The  assumption 
that  the  matrix  has  a  low  rank  plays  a  role  analogous  to  that  of 
sparsity  in  compressive  sensing.)  An  extension  of  this  prob¬ 
lem  seeks  to  decompose  a  matrix  D  of  high-dimensional  data 
into  a  sum  of  two  components,  one  having  low  rank,  the  other 
being  sparse.  This  can  be  expressed  as  the  optimization 

minrank(L)  +  A||5||o  such  that  L  +  S  =  D  ,  (1) 

L,S 

where  ||  •  ||0  counts  the  number  of  nonzero  entries,  and  A  >  0 
is  a  tuning  parameter.  We  can  regard  L  as  a  low-dimensional 
description  of  the  data,  while  S  consists  of  deviations  from 
that  model,  which  can  be  interesting  in  their  own  right. 

We  can  compare  (1)  to  Principal  Component  Analysis 
(PCA),  which  would  compute  the  matrix  L  of  desired  rank 
that  minimizes  || D  —  L ||2,  the  entry-wise  Euclidean  norm 
of  the  residual.  Because  the  second  term  of  (1)  penalizes 
only  the  number  of  deviations  and  not  their  size,  the  low¬ 
dimensional  model  L  will  not  be  perturbed  by  outliers  among 
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the  entries  of  D ,  and  hence  will  provide  a  more  robust  de¬ 
scription  of  most  of  the  dataset.  This  connection  between 
sparse  optimization  and  “robust  PCA”  was  made  by  Candes 
et  al.  [2],  who  also  provided  a  tractable,  convex  approxima¬ 
tion,  which  they  called  Principal  Component  Pursuit,  of  the 
NP-hard  problem  (1) 

min  ||<j(L) ||i  +  A||5||i  such  that  L  +  S  =  D  ,  (2) 

where  D  is  m  x  n  and  A  =  1  / ^/ma x{m,  n}.  The  first  term 
is  the  l1  norm  of  the  vector  a(L)  of  singular  values  of  L,  and 
is  known  as  the  nuclear  norm  of  L.  Applications  considered 
thus  far  include  automated  background  removal  in  video  [3], 
text  analysis  [4],  and  image  alignment  [5]. 

This  decomposition  approach  assumes  that  there  is  a  sin¬ 
gle,  low-dimensional  model  that  describes  most  components 
of  the  elements  of  the  dataset.  In  this  work,  we  develop  a 
more  general  method  that  is  suitable,  for  instance,  for  data  de¬ 
scribed  by  a  manifold  [6],  except  for  a  sparse  set  of  possibly- 
large  deviations.  We  will  thus  allow  our  low-dimensional  de¬ 
scription  to  vary  across  the  dataset,  while  retaining  the  robust¬ 
ness  given  by  having  a  second,  sparse  component.  In  the  con¬ 
text  of  video  background  removal,  this  will  allow  us  to  handle 
the  case  of  a  moving  camera,  making  the  method  suitable  for 
a  much  larger  class  of  surveillance  problems. 

2.  LOCAL  PRINCIPAL  COMPONENT  PURSUIT 

The  geometric  intuition  motivating  our  approach  is  that  if  the 
data  lie  within  a  nonlinear  manifold,  then  every  sample  in  the 
manifold  may  be  represented  (assuming  adequate  sampling 
density)  as  a  sparse  linear  combination  of  neighboring  sam¬ 
ples  spanning  an  approximation  to  the  local  tangent  plane. 
This  idea  can  be  implemented  as  the  problem 

min«||C/||i+^||U||2,i  +  ||5||i  such  that  DU  +  S  =  D  ,  (3) 

in  which  the  explicit  notion  of  low  rank,  and  its  nuclear- 
norm  proxy,  is  replaced  by  representability  of  a  matrix  as 
a  sparse  representation  on  itself.  (The  subspace  segmen¬ 
tation  algorithm  of  Liu  et  al  [7]  also  employs  the  concept 
of  self-representability,  but  combines  it  with  a  nuclear- 
norm  proxy  for  rank,  as  in  (2).)  The  2, 1-norm,  defined  as 
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II U\\ 2,1  =  Ysi  yj E j  uij  »  encourages  rows  of  U  to  be  zero, 
but  does  not  discourage  nonzero  values  among  the  entries  of 
a  nonzero  row  [8].  This  takes  advantage  of  the  group -sparsity 
structure  that  can  arise  when  points  of  the  dataset  are  near 
to  each  other.  It  also  plays  the  vital  role  of  penalizing  away 
degenerate  solutions  in  which  U  is  approximately  the  identity 
matrix,  which  could  arise  if  there  were  only  a  1-norm  penalty 
on  U.  The  \\U\\i  term  is  included,  however,  since  we  usu¬ 
ally  (an  exception  would  be  when  the  data  lie  within  a  single 
low -rank  subspace)  also  wish  to  encourage  zero  values  within 
each  nonzero  row  of  U. 

To  better  handle  noisy  data,  we  replace  (3)  with  a  pe¬ 
nalized  form,  and  add  a  Total  Variation  (TV)  penalty  on  the 
sparse  deviations  (for  cases  when  we  expect  these  deviations 
to  form  contiguous  regions),  giving  the  problem 

rnm^WAU  +  S-DWl+aWUh 

+  m2,i+7l|S||i  +  5||VS||1)  (4) 

where  the  dictionary  A  is  derived  from  the  data  D  ( e.g .  by 
mean- subtraction  and  scaling),  and  V5  is  a  vector- valued  dis¬ 
cretization  of  the  3-D  gradient  of  S,  interpreted  as  a  data  cube. 

Eq.  (4)  can  be  solved  efficiently  using  the  Split  Bregman 
method  [9].  We  introduce  variables  P,  Q,  and  R,  which  are 
auxiliary  versions  of  U,  S,  and  VS,  respectively.  We  add 
terms  relaxing  the  equality  constraints  of  each  quantity  and  its 
auxiliary  variable,  and  in  order  to  enforce  equality  at  conver¬ 
gence,  we  introduce  Bregman  variables  Bp,  Bq,  and  Br  [9]: 

mm  R \\\AU  +  S-  D\\l  +  aWPh  +  0||P||2>1 

+  7\\Q\\1+5\\R\\1  +  ±\\P-U-Bp\\l 

+  ^\\Q-S-  Bq\\l  +  ^\\R- VS  -  Br\\l  .  (5) 

This  allows  the  problem  to  be  split  into  an  alternating  mini¬ 
mization  of  the  following  subproblems: 

mm  l-\\AU  —  (D  —  S)|||  +  ^\\U  -  (P  -  Bp)\\l  ,  (6) 

mmi||S  -  (D  -  AU)\\l  +  f  ||  S  -  (Q  -  Bq)f2 


+  |||V5—  (R-  Br)\\l,  (7) 

min^llP-^  +  ^lll  +  allPllx+^IIPIb,!  ,  (8) 

mm^HQ  —  (S  +  Pg)||2  +  7IIQII1  1  and  (9) 

min^IlP-  (VS  +  Pr)|||  -MUPUi  .  (10) 

R  Z 


Subproblems  (6)  and  (7)  are  simple  l2  problems,  and  can 
be  solved  by  standard  techniques  for  solving  linear  systems 


(e.g.,  conjugate  gradient).  The  other  three  subproblems  can 
be  solved  very  cheaply  using  shrinkage.  Subproblems  (9) 
and  (10)  use  standard  shrinkage,  also  known  as  soft  thresh¬ 
olding: 

shrink(T,  Q  =  sign(T)  max{0,  |Tj  —  £}  ,  (11) 


where  the  operations  are  to  be  understood  entry  wise.  Sub¬ 
problem  (8),  which  contains  both  the  1-  and  2, 1-norm,  uses  a 
generalized  shrinkage,  defined  row-wise  by 


shrink2?i(T,  p)1 


shrink  (T%  £) 

1  +  77 /  shrink( ||  shrink(T%  £)||2, 77) 


(12) 

with  the  convention  that  1/(1  +  77/0)  =  0.  The  algorithm 
consists  of  iteratively  solving  the  main  variables  and  updating 
the  Bregman  variables  as  follows: 


P(fc+1)  =(AtA  +  XI)-1  ( At(D  -  S(k))  +  A (P(fe)  -  B^))  , 
S(k+i)  ^((i /U)/  +  ^VTV)_1((P  -  AU{k+l)) 

+  /K(<5(fc)  -  B<*))  +  ^VT(P(fc)  -  B '<*>))  , 
p{k+i)  =shrink2,i (U(k+1)  +  BW,  a/ \,P/\)  , 
q(k+i)  =shrink(S(A:+1)  +Pf},7/A()  , 

R(k+i)  =  shrink( VP(fc+1)  +B(rk\6/v)  , 


_g(fc+i)  — b (k)  _|_  _  p(^+t)  ^ 

B(k+1)  =B(k)  +  g(k+ 1)  _  Q(k+ 1)  ^  and 

B(k+1)  =B(k)  +  VS<(fc+ 1)  _  R(k+ 1)  _ 


We  initialize  all  of  these  variables  with  zero  vectors,  but  con¬ 
vergence  is  not  expected  to  be  dependent  on  this  choice. 


3.  ADAPTIVE,  OUTLIER-REMOVED  DICTIONARY 

In  (4),  an  appropriate  sparse  U  can  be  viewed  as  generating  a 
locally  low-dimensional  approximation  AU  of  D  —  S.  When 
the  dictionary  is  simply  the  data  (i.e.,  A  =  D ),  the  sparse 
deviations  (or  outliers)  S  are  also  the  deviations  of  the  dictio¬ 
nary  A,  so  constructing  the  locally  low-dimensional  approxi¬ 
mation  as  (A  —  S)U,  implying  an  adaptive  dictionary  A  —  S, 
should  allow  U  to  be  even  sparser.  This  gives  the  modified 
problem 

mini||(^  -S^  +  S-Dg  +  aWUh 

+  /?||£/||2,i+7ll%  +  <5||V%  .  (13) 

This  problem  can  be  minimized  as  before,  the  only  changes 
being  to  the  subproblems  for  U  and  S : 

mini||(^  -  S)U  -{D-  5)|||  +  3|| U  -  (P  -  Bp) |||  , 

mmi||S(7  -U)-(D-  AU)\\\  +  || \V  -  (Q  -  Bq)\\l 

+  ^VS-(R-Br)g, 
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with  solutions  given  by  the  linear  systems 

((A-  S)t(A-  S)+XI)U 

=  {A  -  S)t(D  -S)  +  X(P  -  Bp)  , 
5(7  -U){I-  U)T  +  {nl  +  jA7tV)5 
=  (D-  AU)(I  -  U)T  +  n{Q  -  Bq)  +  vVt(R  -  Br)  . 

This  adaptive-dictionary  approach  is  still  possible  when 
A  D,  depending  on  how  A  is  derived  from  D ,  but  the 
resulting  equations  for  U  and  S  are  a  little  more  complicated. 

4.  RESULTS 

We  test  our  algorithm  on  the  video  background  removal  prob¬ 
lem  addressed  by  Wright  et  al.  [3],  using  a  288-frame  traffic 
video  sequence  from  the  Lankershim  Boulevard  Dataset  [10, 
camera  4,  8:45-9:00  AM].  (This  problem  provides  a  conve¬ 
nient  comparison  between  these  two  general  data  decomposi¬ 
tion  techniques,  but  while  the  performance  of  our  method  is 
subjectively  quite  good,  we  do  not  claim  that  it  is  competi¬ 
tive  when  compared  with  application- specific  algorithms  for 
this  problem.)  We  use  the  modified-dictionary  form  (13)  of 
the  algorithm  since  it  gives  better  results.  A  and  D  were  both 
constructed  from  the  data  by  subtracting  the  mean  from  each 
column  and  scaling  so  that  the  maximum  value  was  1 . 

The  first  test  sequence  is  a  reduced-resolution  (240  x  320 
pixel  frames)  version  of  the  data,  with  each  frame  of  the  video 
being  a  column  of  D ,  giving  a  76800  x  288  matrix.  Because 
the  traffic  camera  is  stationary,  this  dataset  is  well-modeled  by 
a  single  low-dimensional  subspace.  Our  algorithm  gives  a  de¬ 
composition  (see  Fig.  1)  that  is  visually  almost  indistinguish¬ 
able  from  the  result  (omitted  here  due  to  space  constraints) 
obtained  by  solving  (2),  using  the  algorithm  of  [1 1]. 

Our  second  test  sequence  simulates  a  panning  camera  by 
taking  a  moving  240  x  320  pixel  cropping  window  within  the 
original  sequence.  This  window  moves  slowly  to  the  left,  and 
then  back  to  the  original  position,  at  a  rate  of  1/4  pixel/frame. 
In  this  case  the  background  is  poorly  approximated  by  any 
single  low-dimensional  subspace,  but  since  the  background 
motion  is  slow  with  respect  to  the  foreground  motion,  a  lo¬ 
cally  low-dimensional  model  provides  a  much  better  approx¬ 
imation.  A  comparison  of  a  single  frame  of  the  sparse  com¬ 
ponents  of  different  methods  applied  to  this  data  is  provided 
in  Fig.  2.  The  “local”  sparse  components  computed  using  our 
algorithm  clearly  have  far  less  residual  background  than  the 
“global”  sparse  component  resulting  from  (2). 

5.  CONCLUSION 

We  have  proposed  a  new  decomposition,  together  with  a  Split 
Bregman  type  algorithm,  for  high- dimensional  data,  general¬ 
izing  the  Robust  PCA  of  Candes  et  al.  [2]  to  certain  nonlinear 
data.  The  ability  of  this  generalization  to  model  data  that  does 


not  conform  to  the  globally  low- dimensional  restriction  has 
been  demonstrated  on  the  video  background  removal  prob¬ 
lem.  Future  work  will  include  development  of  automatic  pa¬ 
rameter  selection  methods,  and  application  of  the  decompo¬ 
sition  to  additional  problems  in  which  the  relaxed  constraints 
on  the  data  can  be  expected  to  provide  an  advantage. 
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(a)  Original  data  (b)  Local  low  rank  (c)  Local  sparse 


Fig.  1.  Results  for  frame  166  from  the  stationary  test  video  sequence.  The  decomposition  was  computed  using  our  algorithm 
with  parameters  a  =  1.0  x  10_5,/3  =  1.0  x  10_2,7  =  3.0  x  10-5,  and  S  =  1.0  x  10-4. 


(c)  Local  sparse  (d)  No-TV  local  sparse 


Fig.  2.  Results  for  frame  166  from  the  slowly-panning  test  video  sequence.  The  global  sparse  component  (b)  is  obtained 
using  decomposition  (2)  (as  in  [3]),  and  the  local  sparse  component  (c)  is  generated  by  our  algorithm  with  parameters  a  = 
4.0  x  10-3,  (3  =  8.0  x  10-2, 7  =  5.0  x  10-4,  and  5  =  3.0  x  10-4.  Component  (d)  is  generated  in  the  same  was  as  (c),  except 
that  S  =  0,  so  that  there  is  no  TV  regularization.  This  example  demonstrates  that  the  performance  advantage  of  our  algorithm 
is  primarily  due  to  the  local-linear  model,  and  not  the  TV  regularization. 
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